CAGEF_services_slide.png

Advanced Graphics and Data Visualization in R

Lecture 04: Expressing expression data


0.1.0 An overview of Advanced Graphics and Data Visualization in R

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the fourth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will focus on standard visualizations of expression data. While our prior topics have focused on making visualizations that are applicable broadly to many types of data, a number of this week's visualizations are used mainly on multi-dimensional data from experiments like RNAseq.

At the end of this lecture you will have covered the following topics

  1. Visualizing flowcharts and workflows
  2. Scatterplot variants of RNAseq data
  3. Colour gradient visualizations aka heatmaps
  4. RNAseq data analysis with goseq
  5. Visualizing relationships between samples

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on differential expression analysis. The basic visualizations of this data after it has already been processed by packages like DESeq2.

0.4.1 Dataset 1: Lecture04_sankey_data.csv

This is an example dataset used for building a Sankey diagram. It builds a theoretical workflow for RNAseq analysis and visualization within a manuscript. What is a Sankey diagram? We'll find out!

0.4.2 Dataset 2: Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv

RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151

0.4.3 Dataset 3: Blanco-Melo2020Cell.Supp1.xlsx

This is an RNAseq dataset comparison for infection of multiple cell types with pathogens like SARS-CoV-2, SARS-CoV-1, MERS-CoV, RSV (respiratory syncytial virus), IAV (influenza A virus) and HPIV3 (human parainfluenze virus type 3) published in Cell doi: 10.1016/j.cell.2020.04.026

0.4.4 Dataset 4: Blanco-Melo2020_Supp_Data_4.xlsx

This is an RNAseq dataset comparison for infection of human alveolar adenocarcinoma (A549) cells with and without influenza A virus (IAV) published on bioRxiv doi: https://doi.org/10.1101/2020.03.24.004655

0.4.5 Dataset 5: Lecture04.RData

A saved file with a final GO annotation dataset we look at with our visualizations of dot plots.


0.5.0 Packages used in this lesson

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

networkD3, gghighlight, ggrepel, and UpSetR are used in building some of our visualizations including flowcharts, and upset plots.

goseq, and org.Hs.eg.db are packages that will help us perform some basic Gene Ontology (GO) annotations with our data.


1.0.0 Working with large amounts of data

Until this point, the data sets we have used in lecture have been relatively simple epidemiological information. All of our observations are values like new cases or vaccinated individuals tied to dates or time periods. Whereas most of our observations were connected in some linear fashion the data we examine today will have tens of thousands of observations, each representing a different gene!

Given the complexity of our data, we will discuss ways to sort, partition, visualize and interpret it.


1.1.0 Where does RNAseq data come from?

Many of you may be familiar with the idea of RNAseq data. It begins in the real world (or at the bench) with individuals/groups/conditions compared against a control state. Biological samples can be collected in time series or at a single endpoint. Every experimental conditions should have a matching control or baseline profile for comparison, not to mention biological replicates!

After collecting your samples you still need to prepare and sequence your libraries, check your data quality, trim reads, map them to a transcriptome, estimate/normalize the expression of genes within each library and then compare the expression between libraries/samples to determine if there are transcriptional differences! All of these steps are beyond this class BUT if you plan on working with transcriptomes, you'll eventually encounter this process. Today we'll jump the line and work directly with differential expression (DE) data. That means we'll be at the bottom right corner of the following diagram AKA, the end of the pipeline.

RNAseq_1.png

Image from Bioinformatics Workbook: RNA Sequencing Analysis


1.2.0 Use a Sankey diagram to illustrate your workflow

Sankey diagrams are named after Irish Captain Matthew Henry Phineas Riall Sankey, who first presented this type of visualization in 1898 to convey the energy efficiency of a steam engine!

With larger sets of data across multiple experiments, it can sometimes be helpful to use a Sankey diagram to explain the connection between different aspects of your samples. The key attribute of the Sankey diagram is that the width of a connection (flow) is proportional to the quantity represented. So the Sankey diagram is a flow diagram that also visually quantifies the dynamics of the relationships in your diagram. Think of it much like a stacked bar chart that can split or join with other bars at each connection.

A useful package to generate Sankey diagrams is networkD3 using the sankeyNetwork() function. To generate a Sankey diagram, you need information on two variables with a third optional one:

We'll find a pre-made table to that describes a theoretical RNAseq workflow based in the datasets we'll be working with today. Some sections have been compressed for brevity but we'll find all of that information in ./data/Lecture04_sankey_data.csv. Let's begin by reading it in.


1.2.1 Convert your flow data frame into a node data frame

Although our data frame sankey_info.df contains the information about our connections, the networkD3 package requires some mapping of the nodes within our data. Each unique source and target will be counted as a node and we'll save that information into sankey_nodes.df.


1.2.2 Map your node names back to your source and target data with match()

Now that we have our nodes information stored basically as an identification number, we have to map it back to sankey_info.df. This is essentially like working with a factor that spans two columns instead of just one. To accomplish the mapping, we'll use the match() method. The match() function takes two values:

and returns a vector of length x where each integer value represents the position of its matched value from table. Unmatched values from x (ie not found in table) are set to nomatch.

match() vs "%in%" The match() function and the matching operator %in% are very similar in idea. Whereas the former produced a set of matched positions between sets, the latter produces a logical vector indicating inclusion (TRUE) or exclusion (FALSE) from the intersect of the sets.

Let's try with a quick example with our data


Note that our node information must be 0-indexed! R uses an indexing scheme that begins at 1 so we'll need to offset that information that we generate as well.


1.2.3 Build your Sankey diagram with sankeyNetwork()

Now that we have our two necessary data frames, we can build our Sankey diagram with the sankeyNetwork() function. This function has a few parameters that we'll want to set:

There are an additional number of parameters that allow you to play with the visualization of this diagram including:

For more information on these parameters and others, you can go here


1.3.0 Scatterplot matrices for some quick QA of your read counts

Stepping back one step in our flowchart, just prior to your DE analysis, you'll have a set of read counts generated from RNA-Seq data. The counts are an estimation of each transcript within your data. Some programs may include the analysis of spliced isoforms and others may not.

Often in your RNA-Seq datasets, you should have multiple replicates of your data. A quick way to assess the quality of your datasets is to generate a scatterplot matrix. The scatterplot matrix is an all-by-all assessment of your experiments that generates a scatterplot of read counts for each pair-wise dataset.

12859_2019_2968_Fig3_HTML.webp
A scatterplot matrix can help to quickly assess the quality and consistency of your data.
Figure taken from: Visualization methods for differential expression analysis. Rutter et al., 2019. BMC Bioinformatics 20: 458

1.3.1 Import your wide-format read count data.

We'll being by importing our data from Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. This tab-separated data file contains total RNA experiments from the infection of airway epithelial cells (AECs) under different conditions, treatments with inhibitors, and timepoints.


1.3.2 Prepare your data as a clean read count matrix

Before we proceed to generating our scatterplot matrix, we need to select our data from our main dataset. In particular, the data from Wyler et al., that we imported as wyler_readcounts.df contains data from infecting human airway epithelial cells (AECs) and treating with either DMSO or 200nM of the HSP90 inhibitor, 17-AAG. With each condition there is a control set of uninfected AECs. For each of these 4 conditions, there are 3 replicates and 3 timepoints (24-, 48-, and 72-hours). In total that amounts to 36 sets of data.

For our purposes, we'll:

  1. Compare 24h vs 72h infected+DMSO-treated found in columns 9-11 and 33-35.
  2. Filter our reads to the range of 10-5000 using the if_all() function.
  3. Rename our columns to remove some redunant data (AECII_xx_)

Note that the if_all() function is a rather new addition to the tidyverse and is used to help filter on a predicate across multiple columns. This is very much like the across() helper function we've used previously for summarizing data. Like the across() function, the if_all() helper function uses the following parameters:


1.3.3 Use ggpairs() to generate a scatterplot matrix

You may recall our working with the parallel coordinate plot from lecture 2. Rather than transform the data ourselves, we passed our data along to the ggparcoord() function from the GGally package. Well that package is here to simplify our lives again! Rather than generating the paired data between each experiment ourselves, we can pass along the read count matrix we just generated to the ggpairs() function. Parameters of the ggpairs() function that we'll use include:

Moreover, you may recall that GGally works with the ggplot or is ggplot-extensible so we can use many of the same features from ggplot to fix some of the aesthetics of the result object. In the end, you'll notice that this is basically a faceted scatterplot but the dataframe required to generate it is more complex than what we currently have. For more information on this function, you can check out the GGally manual or check our the list of great examples provided on the authors' github.


Work smarter, not harder: When it comes to certain complex visualizations, it is best to avoid re-inventing the wheel. There are so many packages available that accomplish a great number of different visualizations. The key is to knowing what kind of visualization you want to produce, and then checking if a package exists. While the ggplot package can accomplish quite a few basic and complexly layered visualizations, ones like the above scatterplot matrix require the generation and replacement of multiple plot types. This requires going to a deeper level of knowledge beyond this course. A nice package like GGally takes away those issues while offering the ability to still customize and alter your plot format to a certain extent.

1.3.4 Interpreting a scatterplot matrix

As you can see, the scatterplot matrix can provide a quick way to check that your replicates are similar, while your comparison conditions should show more variation between them. Looking at our above plot, the triangular sections of scatterplots in the top-left and lower-right represent the comparison of replicates in our experiment. These should produce scatter points that are close to the diagonal - meaning the replicates are quite similar in the data produced.

In the lower-left corner of the plot we find the comparison between our two conditions (24- vs 72-hour data) and you can see much more scatter away from the diagonal. If our conditions really do induce different transcriptional profiles, this is what we should expected to see.

Across the diagonal we see the density plot of our read counts. These aren't particularly helpful with this dataset as the data predominantly appears to have low read counts even after we filter for a minimum of 10 reads per gene.

This technique is also a useful way to identify potentially mislabeled samples before processing your data for analysis. Changes from the above pattern can signal the mislabeling of samples or changes to the process of sample preparation.


1.4.0 What does differential expression data look like?

Digging down into your DE results, with the human genome, you are looking to identify trends in gene expression differences across 43,000 potential transcripts sequences - only half of which produce proteins. You can imagine that opening up a tabular file of that size could take a bit of time.

For each DE experiment there are a number of values generated. In a popular package like DESeq2 you will typically find:

Variable Description
Gene name The names of your genes which may be in different formats like Entrez ID, Ensembl, or gene symbol
Base mean Average of the normalized counts values, accounting for size factors, taken over all samples (and or replicates)
Log$_{2}$ fold-change The effect size estimate - how much your gene's expression has changed from control/base conditions
Log fold-change SE An estimate of effect size uncertainty
p-value Hypothesis test against H$_{0}$ that no difference exists between sample groups
adjusted p-value An adjusted p-value after taking into account multiple testing between groups/samples

You can find more information on working with the DESeq2 package here


1.5.0 You can examine data on many scales

Once we have a set of data we can examine it at from many aspects:

  1. Compare all genes within a sample and across samples
  2. Compare DE in a subset of genes within a sample and across samples
  3. Compare DE results based on gene function

Let's explore these different aspects and the kind of plots we can generate to accomodate the size of our data set. First, however, we need to find a useful data set to work with.


1.5.1 Use the readxl library to open your .xlsx files

We've briefly touched on using this package in Lecture 01 to help read our Microsoft Excel-based data. Today's data set from Blanc-Melo et al. (Cell 2020), comes to us in the form of an excel file. Let's use some of the tools from this package to help open up our data file. We'll start by peeking at how many sheets there are using excel_sheets().


1.5.2 Open each sheet separately with read_excel()

Now that we can see there are two sheets in our data, we can assign each one to a different data frame using the read_excel() command which contains the parameter sheet. We can use this to determine which sheet to load either based on it's position (integer) or it's name as specified by excel_sheets().


1.5.3 ABW: Always Be Wrangling (your data)

Yes, it looks like the legend has some helpful information about the dataset itself found in the second sheet. The DE data looks like it's split across 20 columns encompassing 10 experimental data sets. For each data set, it looks like there is a Log$_{2}$ fold-change value (L2FC) and an adjusted p-value (padj).

Let's take a closer look at blanco_legend.df first. From it we can parse out some important information about the experiments themselves like the pathogen and host in each experimental set.


1.5.4 Extract substring patterns and save them using str_match_all()

You may remember that we can easily match for patterns in our string using functions from the stringr package but with a function like str_match_all() you can also include matching groups to your pattern that can be captured. For each string match, this function will return a matrix that contains the complete matching pattern and any grouped patterns that are denoted by the () parentheses.

If a string contains multiple matches to your pattern, it will generate additional rows in the matrix for that string.

Looking at our Notes section of blanco_legend.df, it looks like it follows a regular pattern were we can extract the pathogen and host cell information

section Examples
Pathogen name SARS-CoV-2, RSV
Intermediate strings infection, infection with Ruxolitinib
Host cell type A549, A549-ACE2

1.5.5 Convert our DE data table into long-format and join with our cell information

Now that we have catalogued our experimental information, let's build that into a long-format version of our actual DE dataset found in blanco_data.df. The steps we'll take to prepare our data are

  1. Convert from wide to long format with pivot_longer().
  2. Join with our experimental data information in blanco_exp_info.df
  3. Replace our experiment information (formerly column information) to a consistent format of experiment_value so we can split it out properly.
  4. Separate our experiment_value information. Currently we'll have experiments mapping to both an L2FC and padj value
  5. Convert our data back out a little bit to a slightly wider format with pivot_wider()

2.0.0 Visual inspection of DE across a sample

After all of our data wrangling, you can see we have a data frame with nearly 1/4 million observations. This spans across 10 experimental conditions. From a broad level, we can look across entire data sets to identify genes of interest. There are a couple of ways to do this and we'll begin with a standard set of plots that can convey the overall distribution of our data. While our view of the data is low-resolution we are still able to compare specific gene information centered around mean expression levels, magnitude of comparison, and statistical significance.


2.1.0 The MA plot describes the spread of our DE data

Used originally in the application of Microarray data analysis, these plots are also applied to visualizing high-throughput sequencing analysis.

This is a data transformation between two sets of data such that

We plot our M values along the y-axis against the corresponding A values on the x-axis. Do these characteristics sound familiar? They are standard output for DESeq2 data sets! Unfortunately our above data doesn't have that information anymore, but we have a dataset that does! It's another dataset from an earlier pre-publication copy of the Blanco-Melo et al., 2020 paper: Blanco-Melo2020_Supp_Data_4.xlsx.

Let's open this up and take a look!


2.1.1 ABmW: A Bit more Wrangling with across()

Okay, so we can't use the data right away. The read_excel() command has decided to convert some of our columns to character format because it detected NaN values. So we'll have to quickly coerce the data before moving on. We can quickly convert the data with a mutate() command using the helper verb across().

To successfully use this helper, we will want to list out the columns we want to use in the .cols parameter, and then apply a function like as.numeric. We don't actually need all of these columns, but it's good practice!


2.1.2 Build your MA plot using ggplot2

Given that we already have our data in a differential expression format, we can just use ggplot2 to plot the correct variables to the x and y axes. In this case, we are interested in using the log2FoldChange as our M and baseMean represents our A. There are a number of ways we can colour this plot but we'll take into account our padj values to highlight those genes with DE values that are statistically significant.

Now, if we were using raw or normalized expression data, we would need to calculate the proper values for the transformation of M and A. There are a few packages that can take care of this for you such as ggmaplot which you can find here

For our plot, to differentiate between our points, let's use the gghighlight package from last week! What will our predicates be based on? Log2 fold change? Adjusted p-value?


2.1.3 Interpreting our MA plot

So our plot shows us the general distribution of the DE data. Looking at it, we can see, notably, that it appears that we have many more genes upregulated than downregulated in comparison to our control. We also see, as expected, more low-probability variation occuring with lower normalized mean counts. Conversely, as our overall mean counts increase, we also see less variation in general but we do see a gene that does pass threshold DE with a low enough p-value! To investigate further you could filter the data or look again with a volcano plot!


2.2.0 Volcano plots examine fold-change vs p-value

While the MA plots explored data based on its fold-change values in relation to mean expression, the volcano plot looks purely at differential expression based on it's p-value (ie the probability that the DE is real). To emphasize the importance of the p-value we will -log transform it so smaller p-values are higher up on the y-axis. This quickly partitions data points in the visualization to draw high-DE/low p-value combinations away from the bulk of the population.

To plot this data, we'll use our larger blanco_data_long.df which contains data from multiple experiments. For now, let's focus on just the SARS-CoV-2 data that has been generated in the A549 host cells. Our -log transformation of the p-values will also be problematic when encountering 0, so be careful! Let's compare plots with and without these problem values.

To label our points of interest, we'll also use the geom_text_repel() from the ggrepel package. This will help arrange and plot text onto our scatterplot in a way that avoids collisions with the data and the other labels! You could also experiment with using the directlabels package that we played around with last week to accomplish a similar feat.


2.2.1 Interpreting your volcano plots

Looking at our volcano plots, we can clearly see the volcano shape we are looking for. The data is split between our $\pm$ log$_{2}$ fold changes with points highlighted when we see DE beyond this. You can customize your parameters but you can also see the emphasis on lower p-values in our data set. The most "significant" data is found in the upper left and right quadrants of the graph, which wasn't made clear in the MA plot. You could use the same tricks to label your MA plot too but the data could end up anywhere along the MA plot.

In the end, if you do have access to the original counts, it may be of use to consider this information when further investigating your candidate genes from DE.

Section 2.0.0 Comprehension Question: Comparing our two versions of the volcano plot, we see different sets of genes highlighted as the "top 10" based on adjusted p-values. What makes these two sets of data different? Why do we see datapoints at the top edge of our plots in the second version?

3.0.0 Medium resolution analysis focuses on groups of genes

Now that we've taken an overall look at our data, we can begin to assess our data based on some candidate genes or hypothesis testing. For instance, you may only be interested in looking at genes with an L2FC > 3. In other cases, you may be interested in a group of genes pertaining to a function like the chemokines which are part of the inflammation response.

After selecting a subset of genes we can begin to compare their DE data more meaningfully across different sample sets. Let's continue working with our long-format dataset blanco_data_long.df which contains our multiple RNAseq experiments from various infection conditions. We will begin by generating a list of the highest DE genes with low p-values in the SARS-CoV-2 infection scenario.


3.1.0 Use heatmaps to visualize data across a continuous range of values

Now that we have generated a select group of genes to examine, our DE data is well-suited to visualizing in heatmap or heat plot. In our heatplot we can generate values along two categorical axes. On the y-axis we can plot DE values from our individual genes, while on the x-axis we will show values across multiple experiments. To visualize the data itself, we will use the geom_tile() command.


3.1.1 Interpreting a heatmap of upper values

From our list of the top 18 DE hits in our SARS-CoV-2 set, it looks like we see some similar hits across some of the genes when infecting A549 cells with other viruses like HPIV3 and RSV.

We see strong L2FC values for IL6, IL1A and LAMC2 for instance. IL6 (interleukin 6) has been reported to play a role in both mounting an effective immune response to certain viral infections but its interactions with other factors may also have a potential role in the exacerbation of viral phenotypes. In looking across different host cell infections by SARS-CoV-2 there is again consistent upregulation of IL1A and IL6.

Using a similar method, we can instead focus on genes from a specific family or pathway as long as we have a list. Let's try the family of chemokines which play a role in inducing cell movement during the immune reponse.


3.1.2 Heatmap interpretation

From our heatmap, we can observe that looking only at the chemokines, there is consistent upregulation of CCL5 and CCL2 when infecting our A459 cells with any of these viruses. In comparison, however, HPIV3 and RSV infection show a much higher DE across the CCL family in comparison to SARS-CoV-2 which appears to illicit less inflamatory response in the time-frame sampled.

But wait there's more! While we are dealing with just a subset of our RNA-Seq data, much more is available to visualize on a greater scale. Next week we'll dig deeper into high-dimensional data and how heatmaps can be used to visualize and organize this kind of information to help identify and convey patterns.

3.2.0 Looking at single genes with dot plots

We can further narrow our search instead of looking at whole gene groups, and pick a small set of specific genes. We'll use a dotplot to visualize the data where the y-axis uses the L2FC value to place points and we'll use a categorical x-axis of gene lists. We'll filter our data to only include infections in the A549 host, and also group our data in a few ways:

  1. We'll scale our point sizes based on the adjusted p-value to get a better sense of how probable these changes are.
  2. We'll use colour to define the pathogen used in each experiment.
  3. We'll also facet our data based on pathogen.

3.2.1 Dot plot interpretation

Focusing on small sets of genes we can see some trends across our groups. For instance, 3 of our pathogens produce a strong IL6 response as we've seen in our heatmaps. , we see a >2 L2FC in CCL2 as a response to infection by SARS-CoV-2. Actually our weakest IL6 response to SARS-CoV-2 was seen in the ACE-2-expressing A459 hosts so our initial heatmap didn't give us a full picture of what might be happening.

It would be best to further investigate these on a more specific subset of hosts and/or pathogens.


4.0.0 RNAseq analysis for significant functional terms

Up until this point, we've been feeling our way through the data, rather undirected. Often within manuscripts you will find deeper analyses of differential expression by examining the collection of functional terms to determine trends or pathways of interest. The Gene Ontology (GO) resource represents a codification of gene function through three domains: cellular component (CC), molecular function (MF), and biological process (BP).

Most, if not all, of the genes in our dataset can be described in some way by a GO term. Once we collect these terms, we can begin to look for meaningful groups that could be considered over-represented (or under-represented) in our dataset. Once we have collected this information we can visualize it similar to our individual gene dotplots.


4.1.0 Use the goseq package to pull down GO terms for your genes

Using the goseq package we can perform a full GO term analysis on our DE data with an analysis for enriched terms within our set of over- or underexpressed genes. However, in order to begin the process of pulling down go terms, with the goseq package, we also need to generate a named integer vector to split our gene sets into the two relevant categories.

We'll store our categorized set of genes in genes_SARS_CoV_2 where a 1 represents over/underexpressed genes in our set, and 0 represents all remaining genes.


4.1.1 Choose the correct database to map your gene symbols to GO terms

The goseq package is compatible with a number of organisms and gene names for mapping your data to the GO database. To view which organisms are supported, we can use supportedOrganisms() to view. Since the raw RNA-Seq reads were aligned to the human genome using the hg19 assembly (see Blanco-Melo et al., 2020), we'll be working with hg19 as well. We'll use the annotation information provided for the next part of our analysis.

First, however, let's check that hg19 is indeed a supported genome annotation.


4.1.2 Fitting a probability weighting function (PWF) to our dataset

Looking at our table above we can immediately see that there are 3 potential annotations sets to draw from. Each uses a different kind of ID description: Entrez Gene ID, Ensemble gene ID, and Gene Symbol. Which one do we use? This depends on how our data is formatted but let's take a quick look at this screenshot of CCL8 entry from NCBI:

GeneInformation.png
It's important to know the difference between your gene symbols and various identifiers!

Knowing what we've seen of the data already, the information we're looking for is in the third row. Using the gene symbols in our data, we can

  1. Map our gene symbols to GO annotations, and
  2. Gather the gene lengths for our data.

The second portion of information is required in the generation of a probability weighting function (PWF). This PWF will determine a weighting for each gene which is essentially the probability of a gene being differentially expressed based on its length alone. It's important to generate with information as it can influence our enrichment analysis. Ultimately we use the PWF to help inform the creation of a null distribution in our analysis.

We'll use the nullp() function to generate the PWF for our dataset.


Looking at a plot of our PWF: here we see a subset of genes and their PWF values to see how they fit according to our line of best fit. The goal is for the points to be relatively close to our line of fit. If the values are far off, there may be an issue with the annotation data or how you calculated your gene lengths for the PWF.

4.1.3 Generate over and under-expressed GO terms from our DE data

We're now ready to query the GO database for over- and underexpressed terms in our data. We'll use the goseq() command to do this, providing our PWF object as well as the correct database information. The default method used in this process to generate our enrichment analysis is the Wallenius approximation. You can choose to use random sampling but this method can take much longer with little difference in results.

For more information on this process see the goseq package vignette here


4.2.0 Interpreting our goseq results

Looking at the results of our analysis, we are returned a listing of 22,531 unique GO terms across the 3 domains (CC, BP, and MF). We are given two sets of p-values, one each for over-represented and under-represented terms. With each term we also see how many times that term is represented in our DE data vs the total number of times that term may appear in the GO database we queried. As you can see from the first few rows of data, some GO term categories can encompass a large number of genes.

Remember how we created this data! Let's also remind ourselves that we set up our DE gene list using the data from A459 cells infected with SARS-CoV-2. Looking at the manuscript itself, while having a distinct immune response, the replication of the virus was very low in these cell types which lack the expression of the ACE2 receptor. We're really just using this data to make some visualizations!

To begin our analysis we can filter the enriched GO terms sets by applying a multiple hypothesis test correction with p.adjust() using the Benjamini-Hochberg method to minimize the false discovery rate.


4.2.1 Plot our enriched GO term categories as a dot plot

Looking at our enriched set of GO terms, we see terms like inflammatory response and response to stress. Now that we have a set of enriched GO terms, we can filter and plot this information as a dotplot with the following characteristics:

  1. GO terms listed along the y-axis
  2. size of dots representing category size
  3. use colour to represent the % of DE genes present from that that category

Let's look at terms that include the word "response".


4.2.2 Plot multiple conditions to the same dot plot

With a little bit of elbow grease, we can generate a data frame with multiple GO enrichment analyses for multiple experiments and now we can compare enrichment across data sets!


4.3.0 Upset plots

4.3.1 Can Upset plots help us to make sense of overlapping relationships?

Upset plots are an alternative to Venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, Venn diagrams can be difficult to interpret for greater than 2 or 3 sets. This is a real life figure from BMC Bioinformatics. Sure it looks pretty, but what does the number 24 represent in this picture in terms of A, B, C, D, and E?

Fig2_Lam_BMCBioinformatics2016_17.png
What is the meaning of the value 24 in this diagram? Stare at it long enough and you'll see which group it's in but imagine this was 10 groups?


The UpSet plot was first published in 2014 and has become a helpful tool for visualizing the intersection between components with datasets. Along with the publication came a package for working with these plots. Known as the UpSetR package, there have since been more projects implementing this kind of visualization. While UpSetR has not been updated in nearly 3 years, a more active package known as ComplexUpset is available.

To some extent, the ComplexUpset has extensibility with ggplot, allowing you to use somewhat familiar syntax to modify these plots. Add to this the ability to stack or include additional visualizations of your datasets distributions or other characteristics, and this is a pretty attractive package to work with. </br>


4.3.2 Working with the ComplexUpset package to visualize overlapping datasets

Let's see how UpSet plots work practically. We can compare our data from blanco_data_long.df and compare the overlap of DE genes (after filtering or not). To do this in a practical sense, we again need to convert our values to a boolean representation after some more data wrangling. Our data wrangling steps will include:

  1. Generating a short-list of genes with a high log2 fold-change and low p-value.
  2. Filtering our data to include only those genes.
  3. Converting our data to a wide format where each column represents the inclusion status of a gene (observation) for a specific experiment.

Let's look at our data structure again.


Begin by making our short-list of genes to filter by.


Filter our blanco_data_long.df and pivot to a wide-format containing our logical values for inclusion.


4.3.3 Generating our upset data

As you can see from our data transformations, we now have all of our experiments listed in columns, with each gene represented in each row. As we look down each column we see a value of TRUE or FALSE used to differentiate if there was overexpression of that gene in that specific experimental context.

If we had not filtered based on some set of genes, we would have a data frame with more than 23K columns! Now that we have our genes presented in this way we can proceed to generate an upset plot.

Working with the upset() function to build our plot, we will concern ourselves with the following parameters:


4.3.4 Interpreting our Upset plot

From our upset plot, we can see there are 3 regions.

  1. The left-hand barplot denotes the total number of observations in each set/category.
  2. The bottom plot graphically represents the different combinations of each category up.
  3. The upper barplot displays the number of occurences for the combination displayed in the bottom plot.

From our set sizes, it looks like a L2FC value of > 2 was perhaps too stringent for the IAV and MERS-CoV DE data. Since we did filter our genes initially by the best hits in our SARS-COV-2(A549) dataset, we can see that it has the highest frequency group at 31 genes with over expression in just this set. The next largest overlapping set is of 14 genes between the SARS-CoV-2(A549) set and the RSV(A549) DE data.

You can also see from the plot that there are only 9 combinations of datasets with any real overlap. Had we set min_size = 1 instead, the remainder of the grouping combinations would be seen to contain just a single gene in each.

Overall this can make for a simplified interpretation of overlapping genes versus more complicated Venn diagrams!


5.0.0 Class summary

Today we took a long look at some popular visualizations for expression data sets generated by an experiment like RNAseq. Our analyses focused more on highlighting aspects of the differential expression information itself from a big-picture low-resolution view like volcano plots to zooming down to the individual gene level with dot plots.

Next week we'll look at examples of "big" data from an even broader sense by using dimensional reduction techniques like classifying our datasets into groups with principle component analysis.


5.1.0 Weekly assignment

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on the date of our next class: April 7th, 2022.


5.2.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 References

Sankey diagram documentation: https://www.rdocumentation.org/packages/networkD3/versions/0.4/topics/sankeyNetwork

riverplot, another package for making your Sankey diagrams: https://cran.r-project.org/web/packages/riverplot/riverplot.pdf

MA plots directly from two sets of expression data: https://rpkgs.datanovia.com/ggpubr/reference/ggmaplot.html

Generating goseq data for RNA analysis: https://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf

CAGEF_new.png